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Abstract 



We study the prevalent problem when a test distribution differs from the training dis- 
^ I tribution. We consider a setting where our training set consists of a small number of 

sample domains, but where we have many samples in each domain. Our goal is to 
generalize to a new domain. For example, we may want to learn a similarity function 
using only certain classes of objects, but we desire that this similarity function be 
' applicable to object classes not present in our training sample (e.g. we might seek to 

\ learn that "dogs are similar to dogs" even though images of dogs were absent from our 

training set). Our theoretical analysis shows that we can select many more features 
than domains while avoiding overfitting by utilizing data-dependent variance proper- 
ties. We present a greedy feature selection algorithm based on using T-statistics. Our 
■ experiments validate this theory showing that our T-statistic based greedy feature 

selection is more robust at avoiding overfitting than the classical greedy procedure. 



1 Introduction 



The generalization ability of most modem machine learning algorithms are predicated on the as- 
. sumption that the distribution over training examples (roughly) matches the distribution over the test 

^ data. There is growing literature studying settings where this implicit assumption fails to hold — of- 

ten referred to as domain adaptation or transfer learning. This problem is central in fields such as 
speech recognition |fT9l , computational biology ll20l . natural language processing [|5] |8] [121, and web 
search lleHTTI. 

We examine how severe this problem can be, even on one of the most conventional benchmark datasets, 
the MNIST digits dataset. Here, state-of-the-art algorithms reliably obtain classification error rates 
below 1%, when recognizing one digit vs. the other digits. Consider a natural modification of this 
setting where we train a model to recognize the digit "2" vs. the other even digits. If we leam to 
recognize a "2" accurately (vs. only even digits), then we may hope that our classifier will robustly 
recognize a "2" against new odd digits. Unfortunately, this is far from being true: a logistic regression 
algorithm, trained on this dataset and achieving a (true) test error rate of about 0.5% (against even 
digits), jumps to 35% error rate when tested vs. odd digits, a startling 7000% increase in error While 
the present work uses deep belief network features [13], trained on unlabeled data, this situation is 
generic across many other common training methods we have tried: SVMs with various kernels and 
logistic/linear regression with various feature choices (where error rates increase from hundreds to 
thousands of percent depending on the details of the experiment). 

We elucidate this overfitting issue by examining how various "area under the ROC curves" change as 
we greedily add more features. Here, we train our model to recognize the digit "2" vs. eight other 
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Figure 1: Area under ROC (AUROC) (y-axis) for predicting digit '2' using the greedy algoritiim. The x-axis 
shows the order of variables were picked out of 2000 total variables. The top dashed line shows generalization or 
'test' performance on the source domain. The transfer to a new domain is shown below it as the solid line. The last 
figure on the right shows the average of all the ROC curves. The horizontal line is at "chance." 



digits, and test recognition of a "2" vs. the remaining digit (with balanced distributions where a "2" 
appears half the time in both the training and test distributions). The first four plots in Figure [T] show 
the test performance using the area under the ROC metric on the training distribution (the dashed red 
curve) and on the test distribution (the solid red curve) vs. the number of features we have greedily 
added. Note two striking effects: 1) how rapidly the test performance plummets; 2) more troubling, 
how quickly the test performance for the training and test distributions diverge. In particular, note that 
the true generalization performance on the source training distribution is not at all reflective of the true 
generalization performance on the target test distribution, even after adding just a few features: a classic 
example of overfitting. The final plot shows the average of the training and test performance, averaged 
over which digit is held out, and cycling through digits. 

Overfitting is to be expected, because this experiment violates the learning theoretic preconditions for 
successful generalization. Furthermore, for this paiticular experiment, we could argue that a generative 
approach is more robust: if we have a model for generating a "2", then it should be good for recognition 
in diverse settings. While the generative framework is promising, particularly for generating predictive 
features, it is often difficult to specify good generative models. 

In this work, we assume a distribution over domains, and that our training sample consists of a small 
number of sample domains independently drawn from the distribution over domains and where we have 
access to many samples in each domain. The goal in our setting is to perform well on new domains 
sampled from this distribution. For example, in the previous experiments, we can consider that we 
have eight sample (known) domains in our training set, where domains are of the form "2 vs. 0", "2 
vs. 1", "2 vs. 3", etc. This is much like the standard supervised learning model, except that sampled 
"points" are now "domains". The challenge is that we desire to avoid overfitting with an extremely 
small number of domains, or few "samples", compared to standard supervised learning paradigms, that 
assume hundreds to millions of samples. 

The problem of domain adaptation is more general than this particular formulation, where our focus 
is on how to do well on a new random domain. There are numerous different aspects of the domain 
adaptation problem that have been studied. For example, assumptions considered are: when the classes 
are "imbalanced" (e.g. when Pr[F|_D] could vary with the domain D); "covariate shift" [4] where 
Pr[X|Z3] varies with the domain D, while Pr[y|X, D] is not a function of the domain; under a change 
of representation, the joint distributions of Pr[(X, is more similar B HH [121 |T4, 16|; settings 

where one desires mixtures of predictors which adapt to each domain 1 9 1 . A detailed discussion of these 
models is beyond the scope of this paper (see |17| for a more comprehensive review.). There is also 
a growing body of theoretical work on this problem, including 115,271] that concentrates on either 
characterizing the degradation that can occur due to distributional shift (e.g. |2 |) or robustly training 
using biased sampling, such as the sample selection bias work of |7 |. 

Our work differs in that we assume a distribution over domains, and our focus is on generaUzation on 
new domains. One interesting application of this work is on learning similaiity functions. For example, 
we may desire to learn a similarity function for objects, where objects of the same label have high 
similarity, in manner so as to be able to utilize this similarity function to recognize new objects, not 
present in our training set; a problem known as "zero-shot" learning. 

Our Contributions: Our analysis focuses on the issue of overfitting, and we boiTow the idea from 
small sample statistics that a certain empirical variance should be utilized when deciding whether or 
not an effect is significant, namely, that an added feature will decrease our error. We do this using 
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T-statistics. The key idea is that we can estimate the weight of each feature on each training domain 
separately. Indeed, if this weight varies wildly over the training source domains, then even though 
this feature may be useful on aU our source domains, its potential for generaUzation to new domains 
may be poor. We show that our data-dependent version of feature selection robustly enjoys the usual 
feature selection properties, i.e. we can select many more features than domains, particularly if certain 
data-dependent variances are low, under relatively weak assumptions. 

The contributions of this work are as follows: 

• Using small sample statistics, namely that of T-tests, we provide a more robust procedure to 
add features, which takes into account data-dependent properties. 

• Using the theory of large deviations for self-normalized sums, we show that we can robustly 
add many more feature than domains (exponentially more), utilizing certain empirical vari- 
ances. 

• We empirically demonstrate that we control for overfitting using an alternative greedy proce- 
dure for feature addition, based on the T-statistic. In particular, we show that these ideas can 
be utiUzed towards the theory of "zero-shot learning". 

2 Setting 

A key idea in our setting is that we consider a distribution over domains, which we denote by Pt[D] 
(it is possible that there may be an infinite number of domains). Conditioned on a domain D = d, the 
distribution over input/output pairs is Pr[(X, Y)\D = d], where our inputs are X E Rp. As is standard, 
these inputs could represent a high dimensional feature space. The goal is to find a weight vector which 
minimizes the squared error, averaged over both instances and over the domains. More precisely, the 
error we want to minimize is: 

C{w) = EdEx,y[{Y - w Xf\D] 
where the inner and outer expectations are over {X, Y) and D, respectively. 

Our training set consists of a set of n known domains {di,d2, . . . dn}, where each domain is sam- 
pled independently. In practice, while n is small, we often have a large number of samples in each 
domain, so that the second order statistics can be estimated accurately on each training domain. As a 
natural abstraction, we assume that for each domain d in our training set, we have knowledge of both 

E[XY\D = d] mdE[XX'^\D = d]. 

For our theoretical analysis, we also assume the joint input covariance matrix is known, as it 

can be estimated accurately with unlabeled data. This permits a cleaner exposition in terms of unbiased 
estimation, although this distinction is relatively minor in practice. 

3 Feature Selection and Small Sample Statistics 

Our goal is to avoid overfitting while adding features: we desire confidence that our added feature actu- 
ally improves the error on new domains. The naive greedy method is to add features which maximally 
decreases our training set error, which, as we have shown in the Introduction, can perform remarkably 
poorly. Instead, we provide a theory which more sharply characterizes when adding a feature actually 
improves our performance. 

3.1 Adding a Single Feature 

We first investigate the question of whether or not a single feature improves the null prediction of always 
saying 0. It is natural to base our theory using unbiased estimates, as we often have the most robust 

statistical tests for these estimates. 

Consider a feature Xi, which is normalized so that E[X?] = 1. The optimal weight on this feature is 
w* = E[XiY]. Furthermore, any weight Wi on Xi has regret: 

Ciwi) - C{w*) = {wi -E[X,Y]f 

Hence, with respect to adding just one feature, our task is to find a feature Xi and weight vector Wi 
such that we have confidence that is closer to E[Xiy] than is (as weight corresponds to the null 



3 



prediction). The natural unbiased estimate for w* is simply: 
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The Central Limit Theorem implies that Jli should be close to E[Xiy] on the order of 0(^^^^-^' 

where a{XiY) is the standard deviation. A key idea in small sample statistics is to take into account 
the empirical variance. Here, when determining if Xi is useful, we seek to consider the (unbiased) 
variance estimate: 



n 

—Y^{E[x,Y\d,]--n^? 



n 

k 



and the issue is how to utilize this estimate rather than the true variance. 

In our domain adaptation setting, it may be the case that this covariance for certain "robust" features 
E[XjF] is more consistently correlated with the target — it is these features that we seek to add. By 
contrast, "large" sample analysis typically involves only using an upper bound on the standard deviation 
a{XiY), along with tail bounds such as the Bernstein bound |3|, to get estimates on the deviation 
between Jii and its mean. However, crucially, as a{XiY) could vary greatly with our feature Xi, we 
desire a sharper estimate which takes into account the empirical variance, af. 

If /ii followed a normal distribution, then this question reduces to a Student's T-test. Here, the T 
statistic is: 

While we do not expect the /l^ to actually follow a normal distribution, there is a rather large literature 
showing that the T-test is robust (see for example [|10j). We now demonstrate this point under a 
milder assumption, that jli is symmetric (where the source of randomness is from a random domain). 
Equivalently, this is an assumption that the covariances E[XiF|(i] are symmetric about their mean (i.e. 
both E[Xiy Id] -E[XjF] and -{K[XiY\d]-¥.[XiY]) have the same distribution, where d is the source 
of randomness). The following theorem assumes no moment conditions on Xi or Y (not even upper 
bounds). It shows that we can accurately test an exponential number of features with high confidence. 
This bound has similar behavior to the T-distribution (for fixed n) as we scale the number of features. 

Theorem 1. Assume the random vector ¥\XY\d] — E[Xy] is symmetric (where d is random). Let 
(5 > 0. Suppose T is a set of features whose size satisfies \J^\ < |e"S^ (e.g. it is of size at most 
exponential in n). Then for all Xi in T, we have with probability greater than 1 — b: 



IA.-E[x,r]|<^^4iog^ 

where no moment bounds on X and Y are assumed, aside from existence ofW\XY\d\ andW\XY\ 

The proof of this theorem is in the appendix. The key is that this theorem shows that the empirical 
variance can be taken into account when searching through a large feature set. In fact, asymptotically, 
as implied by the Central Limit Theorem, the only improvement possible is that the constant of 4 would 
become a 2. 

The proof (provided in the Appendix) of this bound is significantly more subtle than the standard 
"Bernstein"-like bounds, since the T-statistic has much "thicker" tails. Our proof is based on the 
following bound for "self-normaUzed" sums, which, to our knowledge, has not been utilized in the 
machine learning literature. 

Theorem 2. (See Theorem 2.15 in fJOj ) Assume Zi, . . . Zn are independent, mean 0, symmetric ran- 
dom variables. For all t > Q, the following bound on the self-normalized sum holds: 



Pr 



< exp 



where no moment bounds on Zi are assumed, aside from its mean existing. 
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For completeness, we add the proof of this theorem in appendix. It is based on a simple symmetrization 
argument along with Hoeffding's tail inequality. Note that the above bound is not quite a large deviation 
bound for a T-statistic, as the denominator uses while a T-statistic would have a term of the 

form ~ ^)^' where Z is the empirical estimate of the mean, X^ILi^j/"- This subtlety leads 

to the condition in Theorem[T]that the size of T is at most exponential in n. 

3.2 Subset Selection 

Merely searching for the lowest error solution over all subsets of, say, size q is prone to overfitting. 
Instead, we seek to take into account the empirical variance when searching over subsets of features. 
We now provide a data-dependent bound showing that the empirical variance can be utilized for a much 
sharper bound. In the next subsection, we discuss a greedy method for this search. 

Given some set of features S of size q, let Xi . . . Xq be an orthonormal basis for this subspace (e.g. 
E[XjXj] is if i ^ j and 1 if i = j. Note that we can put S into this basis as we have assumed 
knowledge of E[J'('J'('^]). The best weight vector for this subspace is again the covariance [/isji = 
E[Xiy]. Define the (unbiased) empirical means and variances as follows: 

1 " 

[^Is]^ = -y2nx^Y\dk], 

n ^ — ' 

fc=i 

1 " 

k=l 

We take Jls as the estimate of the weight vector on this subspace. We now provide our data dependent 
generalization bound, in terms of an appropriate empirical variance. In particular, we are interested in 
a generalization bound for all subsets of size q out of a feature set of size p. 

Corollary 3. Assume the random vector'E\XY\d] — "E^XY] is symmetric. Let 5 > Q. Assume that our 
set of features J- is of size p, and that qp'^ < . For all subsets S G J- of size q, we have: 

L{fis) - Hus) < 2^[crsk 

\ies J 

This bound is analogous to the usual bounds for regression where instead of the sum empirical variance, 
we have the true variance (which is usually assumed to be constant in idealized Gaussian noise regres- 
sion modeQ)- Crucially, the bound shows that we can robustly utilize the empirical variance when doing 
our estimation. The implications of this are that we can design a much sharper procedure for testing if 
a feature improves performance. 

3.3 Practice: The T-Greedy Algorithm 

In practice, the natural methodology is to "greedily" choose a feature instead of searching all subsets, 
which usually consists of finding the feature which decreases the error the most. Instead, we introduce 
the T-greedy algorithm, a "stagewise" procedure for adding the feature which has the highest T-statistic 
(e.g. the goal is to add a feature in which we have the most confidence that the true error will be 
improved). 

There are a variety of greedy regression procedures, such as "stepwise", "stagewise", etc. We now 
present a stagewise variant by considering covariances with our residual error (Y — w ■ X). Suppose 
that our current weight vector is w (on our current set of features). For each feature Xi, we compute 
the empirical mean and variance: 

1 " 

= -y^E[X,{Y~wX)\dk], 
k=\ 

1 " 

k=l 

'For the usual model, where Y = j3X + r) where ri is Gaussian noise with variance c^. The risk bound above 
is just ^ ^ , which is improved by a factor of q. We conjecture if we made the further assumption that the 
random vector E[Xiy|ii] — E[Xiy] is spherically symmetric, then the factor of q can be removed. 
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Figure 2: Area under ROC (AUROC) (y-axis) for predicting digit '2' for both the greedy algorithm (shown in 
red) and for our T-greedy algorithm (shown in blue), as we add more features (x-axis). Since we choose features 
based on T-statistics (see last graph), our T-statistics is generally higher than that of the greedy algorithm. 



Note that with a finite number samples in each domain, we would simply use the empirical estimates 
instead. Now we just add the feature with the highest T-statistic, e.g. add the feature: 

I* = arg max Ti 

i 

where T; = a~/Vn' '^^^ '^^^ update to the weight on this feature is simply: 

ri ' ^ 



Observe that this is actually a biased estimate of the optimal weight on this added feature. Technically, 
our theory is only applicable to using unbiased estimates, where we would have ]E[Xj^] in the denom- 
inator This is a minor distinction in practice, and with unlabeled data we could essentially run the 
unbiased version. We should point out that stepwise variants are also possible. 



4 Experimental Results 

We now present results on the MNIST and CIFAR image datasets. The MNIST digit dataset contains 
60,000 training and 10,000 test images of ten handwritten digits (0 to 9), with 28x28 pixels. In all 
experiments, we use 10,000 digits (1,000 per class) for training and 10,000 digits for testing. Instead of 
using raw pixel values, each image was represented by 2000 real-valued features, that were extracted 
using a deep belief network ||T3l . 

We also present results on the more challenging CIFAR image dataset (TS\, that contains images of 
10 object categories, including airplane, car, bird, cat, dog, deer, truck, deer, frog, and horse. As 
with the MNIST dataset, we use 10,000 images (1,000 per class) for training and 10,000 images for 
testing. Each image was also represented by 2000 real-valued features, that were extracted using a deep 
belief network [18J. We note that extreme variability in scale, viewpoint, illumination, and cluttered 
background, makes object recognition task for this dataset difficult. 

In all experiments, we report the area under ROC (AUROC) metric of two different algorithms, that 
we refer to as the greedy and our proposed T-greedy algorithm. The greedy algorithm chooses the 
next feature which decreases the squared loss the most on the training set. The T-greedy algorithm, 
on the other hand, chooses a feature with the largest T-statistic. For both methods, we report both 
generalization eiTor on our training or 'source' domains as well as generalization error on test or 'target' 
domains. We do not focus on the issue of stopping but rather on robustness. There are a variety of 
methods for stopping which we mention in the Discussion section. 

4.1 MNIST (2 vs. other) 

In our first experiment, shown as the leftmost plot in Fig. |2] we tested the ability of the proposed 
algorithm to generalize to a new target domain: recognizing the digit '2' vs. the new, previously unseen 
digit '9'. To this end, we created eight source domains: {'2' vs. '0'},..., {'2' vs. '8'}, where each 
domain contained a balanced set of 2000 labeled training example^ Our new target domain (as the 
test set) {'2' vs. '9'} also contained a balanced set of 2000 examples. 

^Remember, our key assumption is that the sampled domains are independent and that the source domains are 
known. 
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Figure 3: Area under ROC (AUROC) (y-axis) for learning similarity function for both the greedy algorithm 
(shown in red) and for our T-greedy algorithm (shown in blue), as we add more features (x-axis). Since we choose 
features based on T-statistics, our T-statistics are generally higher than that of the greedy algorithm. 



Fig. 12] the leftmost plot, displays an evolution of the area under ROC (AUROC) metric for both greedy 
(red curves) and T-greedy (blue curves) algorithms. Note that an area of 0.5 corresponds to a random 
classifier, shown on the graph as a horizontal line. The dashed curves correspond to generalization or 
'test' performance on the source domains, whereas the solid curves display performance on the target 
domain. Observe that after adding a few features, test performance of the greedy algorithm on the new 
target domain (red solid curve) rapidly decreases. Test error on the source domains, however, keeps 
improving, clearly demonstrating that no overfitting on the source domains is occurring. Hence, for the 
greedy algorithm, the true error on the source and target domains rapidly diverge. 

This is in sharp contrast to the performance of the T-greedy algorithm. Even though performance of 
the T-greedy algorithm on the source domains (blue dashed curve) is slightly worse (as expected as 
it is not as aggressively striving for source error minimization), the true AUROC on the source and 
target domains diverges less rapidly — in particular, these curves start close together. Fig. |2]further 
shows results for different source/target splits. We consistently observe that as we add few features, the 
T-greedy algorithm overfits much less on the target domain. This consistency is also seen in left most 
plot of Fig. m that displays results averaged over all splits of the source and target domains. 

The rightmost plot of Fig. |2] also shows the T-statistic of the added feature to the model of both algo- 
rithms. We only show one such figure since they all look similar. 

We formulate the similarity learning problem in our regression setting as follows. Given two feature 
vectors corresponding to two images (p{X^) and (p{X'^), we consider a linear regression function: 

i 

where we set j/ = 1 if two images have the same label (positive example), and y = — 1 if two images 
have different labels (negative example). 

Fig. |3] top row, displays results on learning a similarity function for MNIST digits. In particular, 
consider learning a similarity function on all the digits, but with digit '9' excluded. Similar to the 
previous experiment, we constructed nine source domains (corresponding to digits through 8). Each 
domain contained 1000 positive and 1000 negative examples, where negative examples were randomly 
sampled from the remaining digits in the source domain. Our target domain contained 1000 positive 
examples of newly observed images of '9' and 1000 negative examples, randomly sampled from images 
of O-to-8. 

4.2 Learning similarity function 

We now consider a more demanding task of learning a similarity function between two images. A 
good similarity function can provide insight into how high-dimensional data is organized and can sig- 
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Figure 4: Area under ROC averaged over all splits of source/target domains for predicting digit 2 (left), learning 
similarity function for MNIST digits (middle) and CIFAR images (right). 

nificantly improve the performance of many machine learning algorithms that are based on computing 
similarity metric. Our goal is to learn a similarity function that can not only work well for objects that 
are part of the training set, but also works well for new objects that we may have never seen before: a 
widely studied problem known as a "zero-shot" learning. 

Fig. [3] top leftmost plot, shows that the generalization error of the greedy algorithm on the source and 
target domains rapidly diverge. The T-greedy, on the other hand, is able to select up to 25 reliable 
features that help us generalize well to the new target domain. Fig. |3] further displays performance 
results when generalizing a similarity function to different target domains. Again, the rightmost plot 
shows the value of the added T-statistic for one of these plots. 

Finally, we experimented with learning a similarity function for more challenging image CIFAR dataset. 
Similar to the results on the MNIST dataset, Fig.|3] bottom row, shows that the T-greedy algorithm is 
able to consistently pick up to 50 robust features that are useful for transfer to a new domain (note the 
difference in scale on the a;-axis, which now goes to 300 features). The greedy algorithm, however, 
barely improves upon making random predictions. 

We have focused attention on the individual domains to help drive home how variable each domain is 
from the others. But, it is sometimes hard to see the signal amongst all this noise, so we also provide 
averaged versions of the AUROC curves (Fig. |4). The T-greedy algorithm is able to pick up many 
more robust features and overfits far less on the target domain (difference in blue-dashed and blue-solid 
curves). The greedy algorithm's test error diverges after adding only a handful of features. Almost 
immediately we see a big gap in the error on the source and target domains (difference in red-dashed 
and red-solid curves). 

5 Discussion 

All experiments demonstrate that the T-greedy algorithm has better correspondence between training 
AUROC and testing AUROC. The curves start out with the training and the testing AUROC curves 
with about the same value. This is particularly striking in the averaged curves, shown in Fig.|4] So by 
looking only at the training curves one can get a good estimate of the generalization performance. As 
expected, eventually overfitting occurs, since the training AUROC continues to improve whereas the 
testing AUROC decreases. However, even then it is possible to get a handle on using our method (e.g. 
when to stop). One option is to simply keep yet another domain held out for cross-validation and cycle 
through. Alternatively, we can use properties of the T-statistic to get a handle on when to stop (e.g. 
when the T-statistics is behaving like chance). Here, Bonferoni can also be used as a heuristic to decide 
how many variables to use. Again, this is made easier by the fact that the curves are close. 

We also observe that the variability between domains is much greater than the variability within any 
given domain (Figs. 2, 3, and 4 all show this variability). Classical statistics assumes that each error 
is independent (if just merged across all the domains), but we see from plots that each domain behaves 
idiosyncratically. Sometimes they overfit after a few variables, sometimes they continue to improve. 
This means that using more observations from the domains we have already studied is not informative 
of how we will extrapolate to new domains. Such small sample sizes were the original motivation for 
Gosset to come up with his Student's T-statistic. Note that we do not have many degrees of freedom but 
we can still obtain as much information out of the data we have. What we see from our analysis and 
experiments is that this information can still be substantial. 

Acknowledgments RS is supported by NSERC, and NTT Communication Sciences Laboratory. 
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A Appendix 

First, let us prove Theorem |2](also, see Theorem 2.15 in llTOl ). 

Proof, (of Theorem |2]i. Let be Rademacher random variables (e.g. independent random variables 
which take values uniformly in {—1, 1}). Since each Zi is symmetric, we have that the distribution of 
Zi is identical to the distribution of e^Z^. Hence, we have that: 



Pr 



> t 



= Pr 



> t 



Now we bound this latter quantity for every realization of Zi. Consider a fixed set of values zi, . . . z„ 
(some realization of Zi, . . . Z„). For these fixed values, let us now bound the probability: 



> t 



= Pr 



1=1 







n 




n 


Pr 




i=l 




*E-^ 


2exp 


( ^U-l 


) 



where the second to last step is by Hoeffding's inequality (where the only randomness is due to the Ci). 
To see this, note that we are adding the independent variables EiZi which are mean and bounded in 
magnitude by Zi. □ 

Now we prove Theorem[T] 

Proof, (of Theorem [U For symmetric, mean 0, independent Zi, define: 

_^ n ^ n 

/J = -E^«' = — tE(^*^^)^ 

Ti'^-^ n — 1 ^-^ 

1=1 i=l 

The T-statistic is then defined as: 



T = 



Define the related quantity: 



and note that: 



Also, one can show that 



■ . Hence, using the bound on self-normalized sums. 



Pr 



n—l 



< 2 exp 

< 2 exp 



1 n 



2n~ll + J^ 

71—1 , 



1 



21 + ^ 

n— 1 , 
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Now let us choose t — \J 4 log By assumption on the size of T, we have that < n/2, and so 
T&T - 2{n-i) - 1 (since n > 2). Hence, 



Pr 



> 41o{ 



2\n 



< 2 exp 



1^2 

2 2 



S 



Our result now follows by the union bound (over all | features). 



□ 



Our corollary now follows: 



Proof, (of Corollary |3]l For any subset, the regret is: 

ies 

Now note that there are no more than possible subsets. Also, each subset comes with its own basis. 
So let us demand confidence on all qp'^ possible basis elements. So we use Theorem[T]with a set of size 
qp'' features (note that the log of the size of this set is bounded by 2q\ogp). Our theorem now follows 
by summing over the errors. □ 
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